Working with Data

Tidy Data

Expected layout of “tidy” datasets

There’s Data

Gender stereotypes in 5-7 year old children


subject sex age trait target stereotype high_achieve_caution
116 male 5 smart adults 0.00 0.50
139 male 7 smart children 0.75 0.75
12 female 6 smart adults 0.25 0.75
140 male 5 nice children 0.50 0.25
76 female 5 nice adults 1.00 1.00
10 male 5 smart adults 1.00 0.50

Lots of Data

Body girth measurements and skeletal diameter measurements for 247 men and 260 women.

age wgt hgt sex sho_gi wai_gi nav_gi hip_gi
46 84.5 181.6 1 111.5 91.6 102.1 106.7
20 56.4 163.2 0 105.1 66.1 73.2 91.8
37 75.5 177.8 1 116.7 75.9 77.0 93.4
29 81.8 182.9 0 108.9 83.6 108.6 108.3
62 64.6 167.0 1 104.0 76.0 83.0 93.0
43 87.3 188.0 1 119.0 97.8 93.6 102.5

In Every Context

NBA player of the week from 1985 to 2016


Age Date Draft Year Height Player Position
22 Jan 25, 2004 2002 6-9 Carlos Boozer PF
27 Dec 7, 1986 1981 6-10 Tom Chambers PF
27 Feb 8, 2004 1997 6-11 Tim Duncan FC
33 Dec 20, 2010 1998 6-7 Paul Pierce SF
27 Dec 3, 2006 2000 6-6 Michael Redd G
27 Jan 1, 2006 1999 6-7 Shawn Marion F

You Can Imagine

Fish Sampled on Blackfoot River


trip mark length weight year section species
1 0 186 80 2004 ScottyBrown RBT
1 0 142 25 2004 ScottyBrown RBT
1 0 203 80 1996 Johnsrud RBT
2 0 113 25 1990 ScottyBrown RBT
2 1 369 NA 2002 Johnsrud Brown
1 0 192 60 1993 Johnsrud RBT

Your Turn

Every year, the US releases to the public a large data set containing information on births recorded in the country.

A total of 13 variables were collected on every birth, including information about:

  • the birth (baby weight, sex of baby, premie status)
  • the pregnancy (hospital visits, length of gestation, )
  • the mother’s attributes (age, smoking status, marital status, race)
  • the father’s age

How would you expect this dataframe to look?

Types of Variables


Diagram of types of variables we will analyze!

Examples

  • A person’s height (usually) would be a continuous, numerical variable

  • The number of classes someone takes would be a discrete, numerical variable

  • A course letter grade would be a ordinal, categorical variable

  • The color of someone’s hair would be a regular, categorical variable

Your Turn


Suppose researchers have yearly data on Elephant Seal abundance on Pedras Blancas from 2010 - 2014.


What type of variable would year be?

Types of Studies

Experiment

  • randomization
  • replication
  • controlling
  • blocking

Observational Study

  • collect data in a way that does not directly interfere with how the data arise

Relationships Between Variables


explanatory variable \(\rightarrow\) might affect \(\rightarrow\) response variable

  • If two variables are not associated, then they are said to be independent.

  • If two variables are associated, then they are said to be dependent.

Causal Inference


association \(\neq\) causation


What do you need to say that the explanatory variable causes a change in the response variable?

Lab Warm-up

Data Types in R

glimpse(births_small)
Rows: 1,000
Columns: 10
$ fage           <int> 34, 36, 37, NA, 32, 32, 37, 29, 30, 29, 30, 34, 28, 28,…
$ mage           <dbl> 34, 31, 36, 16, 31, 26, 36, 24, 32, 26, 34, 27, 22, 31,…
$ weeks          <dbl> 37, 41, 37, 38, 36, 39, 36, 40, 39, 39, 42, 40, 40, 39,…
$ premie         <chr> "full term", "full term", "full term", "full term", "pr…
$ gained         <dbl> 28, 41, 28, 29, 48, 45, 20, 65, 25, 22, 40, 30, 31, NA,…
$ weight         <dbl> 6.96, 8.86, 7.51, 6.19, 6.75, 6.69, 6.13, 6.74, 8.94, 9…
$ lowbirthweight <chr> "not low", "not low", "not low", "not low", "not low", …
$ sex            <fct> male, female, female, male, female, female, female, mal…
$ habit          <chr> "nonsmoker", "nonsmoker", "nonsmoker", "nonsmoker", "no…
$ whitemom       <chr> "white", "white", "not white", "white", "white", "white…

What do you think dbl means?

How is that different from int?

What does chr mean?

How might it differ from fct?